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Systems fail, and sometimes no one is around to reset them before something worse 
happens. That's why watchdog timers matter. 

Launched in January 1994, NASA's Clementine spacecraft spent two successful months mapping 
the moon before leaving lunar orbit to head towards near-Earth asteroid, Geographos. 

A dual-processor Honeywell 1750 subsystem handled telemetry and various spacecraft functions. 
Though the 1750 could control Clementine's thrusters, it did so only in emergency situations; 
routine thruster operations were under ground control. 

On May 7, the 1750 experienced a floating-point exception. This wasn't unusual; some 3,000 prior 
exceptions had been detected and handled properly. But immediately after the May 7 event, 
downlinked data started varying wildly and nonsensically. Then the data froze. Controllers spent 20 
minutes trying to bring the system back to life by sending software resets to the 1750; all were 
ignored. A hardware reset command finally brought Clementine back on-line. 

Alive, yes, even communicating with the ground, but with virtually no fuel left. 

The evidence suggests that the 1750 locked up, probably due to a software crash. While hung, the 
processor turned on one or more thrusters, dumping fuel and setting the spacecraft spinning at 80 
RPM. In other words, it appears the code ran wild, firing thrusters it should never have enabled. 
They kept firing until the tanks ran nearly dry and the hardware reset closed the valves. The 
mission to Geographos was abandoned. 

Designers had worried about this sort of problem and implemented a software thruster timeout. 
That failed, of course, when the firmware hung. 

The 1750's built-in watchdog timer (WDT) hardware was not used, over the objections of the lead 
software designer. With no automatic reset, the mission's success rested in the abilities of the 
controllers on Earth to detect problems quickly. For the lack of a few lines of watchdog code, the 
mission was lost. 

Though such a fuel dump had never occurred on Clementine before the May 7 meltdown, hardware 
resets from the ground had been required roughly 16 times. One might also wonder why some 
3,000 previous floating-point exceptions were part of the mission's normal firmware profile. 

Not surprisingly, the software team wished they had indeed used the watchdog and not 
implemented the thruster timeout in firmware. They also noted, though, that a normal, simple 
watchdog might not have been robust enough to catch the failure mode. 
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Contrast this with Pathfinder, a mission whose software also famously hung, but which was saved 
by a reliable watchdog. The software team found and fixed the bug, uploading new code to a target 
system 40 million miles away, enabling an amazing scientific mission on Mars. 

Watchdogs are our fail-safe, our last line of defense, an option taken only when all else fails, right? 
Actually, these missions and many others suggest to me that WDTs are not emergency outs, but 
integral parts of our systems. The WDT is as important as main() or the runtime library; it's an 
asset that is likely to be used, and maybe used a lot. 

Space is a hostile environment, of course, with high intensity radiation fields, thermal extremes, 
and vibrations we'd never see on Earth. Do we have these worries when designing Earth-bound 
systems? 

Maybe we do. Intel revealed that the McKinley processor's ultra-fine design rules and huge 
transistor budget means cosmic rays can flip on-chip bits. The Itanium 2 processor, also sporting 
an astronomical transistor budget and small geometry, includes an on-board system management 
unit to handle transient hardware failures. The hardware ain't what it used to be, even if our 
software could be perfect. 

Reality check 

But too much (all?) firmware is not perfect. Consider this unfortunately true story from long-time 
reader Ed VanderPloeg: 

The world has reached a new embedded software milestone: I had to reboot my hood fan. That's 
right, the range exhaust fan in the kitchen. It's a simple model from a popular North American 
company. It has six buttons on the front: three for low, medium, and nigh fan speeds, and three 
more for low, medium, and high light levels. Press a button once and the hood fan does what the 
button says. Press the same button again and the fan or lights turn off. That's it; nothing fancy. 
But it needed rebooting via the breaker panel. 

Apparently the thing has a microprocessor to control the light levels and fan speeds, and it also has 
a temperature sensor to automatically switch the fan to high speed if the temperature exceeds 
some fixed threshold. Well, one day we were cooking dinner as usual, steaming a pot of potatoes, 
and suddenly the fan kicked into high speed and the lights started flashing. 'Hmm, flaky sensor or 
buggy sensor software, ' I thought to myself. 

The food happened to be done, so I turned off the stove and tried to turn off the fan, but, I 
suppose, it wanted things to cool off first. Fine. After ten minutes or so the fan and lights turned off 
on their own. I then went to turn on the lights, but instead they flashed continuously, with the flash 
rate depending on the brightness level I selected. 

Just for fun I tried turning on the fan too, but all three speed buttons produced only high speed. 
'What 'smart' feature is this?' I wondered to myself. 'Maybe it needs to rest a while. ' I turned off 
the fan and lights and went back to finish my dinner. For the rest of the evening the fan and lights 
would turn on and off at random intervals and random levels, so I gave up on the idea that it 
would self-correct. With a heavy heart I went over to the breaker panel and flipped the hood fan 
breaker to and fro. The hood fan was once again well-behaved. 

What's the embedded world coming to? Will programmers and companies everywhere realize the 
cost of their mistakes and clean up their act? Or will the entire world become accustomed to 
occasionally rebooting everything they own? Will expensive embedded devices come with a 'reset' 
button advertised as a feature? Will President Bush threaten war on programmers, unless U.N. 
code inspections are permitted? Or worst of all, will programmer jokes become as common and 
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ruthless as lawyer jokes? I wish I knew the answer. I can only hope for the best, but I fear the 
worst. 

At last year's Embedded Systems Conference, a participant at a seminar I led admitted that his 
consumer products company could care less about the correctness of firmware. Reboot-who cares? 
Customers are used to this, trained by decades of desktop computer disappointments. Hit the reset 
switch, cycle power, remove the batteries for 15 minutes; even preteens know the tricks of coping 
with legions of embedded devices. 

Crummy firmware is the norm, but it's totally unacceptable. Shipping a defective product is 
begging for torts. So far, the embedded world has been mostly immune from predatory lawyers, 
but that Brigadoon-like isolation is unlikely to continue. More importantly, it's simply unethical to 
produce junk. 

But it's hard, even impossible, to produce perfect firmware. We must strive to make the code 
correct, but also design our systems to cleanly handle failures. In other words, a healthy dose of 
paranoia leads to better systems. 

I was astonished to find that in almost 13 years of writing this column I've somehow managed to 
neglect watchdog timers. So here's my take, which is quite different from that espoused by most 
pundits. 

Well designed watchdog timers fire off every day, quietly saving systems and lives without the 
esteem offered to human heroes. 

Poorly designed WDTs also fire off a lot, sometimes saving things, sometimes making them worse. 
A simple-minded watchdog implemented in a non-safety critical system won't threaten health or 
lives, but can result in systems that hang and do strange things that annoy customers. No business 
can tolerate unhappy customers, so unless your code is perfect (whose is?), it's best in all but the 
most cost-sensitive apps to build a really great WDT. 

An effective WDT is far more than a timer that drives reset. Such simplicity might have saved 
Clementine, but would it fire when the code tumbles into a really weird mode like the one 
experienced by Ed's hood fan? 

What follows is a look at the current state of the art. Next month, I'll boil all of this down to 
concrete design recommendations. 

Internal watchdogs 

Virtually all highly integrated embedded processors include a built-in watchdog. Unfortunately, 
most of these are braindead. 

Toshiba's TMP96141AF is part of their TLCS-900 family of microprocessors, which offers a wide 
range of extremely versatile on-chip peripherals. All have pretty much the same watchdog circuit. 
As the data sheet says, "The TMP96141AF is containing watchdog timer of Runaway detecting." 

Ahem. And I thought the days of Jinglish were over. Anyway, the part generates a non-maskable 
interrupt (NMI) when the watchdog times out, which, as we'll see next month, is either a very, 
very bad idea or a wonderfully clever one. It's clever only if the system produces an NMI, waits a 
while, and only then asserts reset, which the Toshiba part unhappily cannot do. Reset and NMI are 
synchronous. 
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Motorola's widely-used 68332 variant of their CPU32 family (like most of these 68k embedded 
parts) also includes a watchdog. It's a simple-minded thing meant for low-reliability applications 
only. 

Unlike a lot of WDTs, user code must write two different values (0x55 and OxAA) to the WDT 
control register to ensure the device does not time out. This is a good thing; it limits the chances of 
rogue software accidentally issuing the command needed to appease the watchdog. I'm not thrilled 
that any amount of time may elapse between the two writes (up to the timeout period). Requiring 
two back-to-back writes would further reduce the chances of random watchdog tickles, though one 
would then have to ensure no interrupt could preempt the paired writes. 

Another problem is that 0x55 and OxAA patterns are often used in RAM tests; since the 68k I/O 
registers are memory mapped, a runaway RAM test could keep the device from resetting. 

The 68332's WDT drives reset, not some exception-handling interrupt or NMI. This makes a lot of 
sense, since any software failure that causes the stack pointer to go odd will crash the code, and a 
further exception-handling interrupt of any sort would drive the part into a "double bus fault." The 
hardware is such that it takes a reset to exit this condition. 

Motorola's popular Coldfire parts have a similar WDT. The MCF5204, for instance, will let the code 
write to the WDT control registers only once. Cool! Crashing code, which might do all sorts of silly 
things, cannot reprogram the protective mechanism. However, it's possible to change the reset 
interrupt vector at any time, pretty much invalidating the clever write-once design. 

As with the CPU32 parts, an 0x55/0xAA sequence keeps the WDT from timing out, and back-to- 
back writes aren't required. The Coldfire datasheet touts this as an advantage since it can handle 
interrupts between the two tickle instructions, but I'd prefer a smaller window. The Coldfire has a 
fault-on-fault condition much like the CPU32's double-bus fault, so reset is the only option when 
the WDT fires-which is a good thing. 

Unfortunately, there's no external indication that the WDT timed out, perhaps to save pins. That 
means your hardware/software must be designed so that at a warm boot, the code can issue a 
from-the-ground-up reset to every peripheral. 

Philip's XA processors require two sequential writes of 0xA5 and 0x5A to the WDT. But, like the 
Coldfire, there's no external indication of a timeout, and it appears the watchdog reset isn't even a 
complete CPU restart; the docs suggest it's just a reload of the program counter. Yikes-what if the 
processor's internal states were in disarray from code running amok or a hardware glitch? 

Dallas Semiconductor's DS80C320, an 8051 variant, has a powerful WDT circuit that generates a 
special watchdog interrupt 128 cycles before automatically, and irrevocably, performing a hardware 
reset. This gives your code a chance to safe the system and leave debugging breadcrumbs behind 
before a complete system restart begins. Pretty cool. 

External watchdogs 

Many of the supervisory chips we buy to manage a processor's reset line include built-in WDTs. 

Texas Instrument's UCC3946 is one of many power supervisor parts that do an excellent job of 
driving reset only when Vcc is legal. In a small 8-pin surface mount package, it eats practically no 
real estate on a printed circuit board. It's not connected to the CPU's clock, so the WDT will output 
a reset to the hardware fail-safe mechanisms even if there's a crystal failure. But it's too darn 
simple: to avoid a timeout, just wiggle the input bit once in a while. Crashed code might do this in 
any number of ways. 
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TI isn't the only purveyor of simplistic WDTs. Maxim's MAX823 and many other versions are 
similarly flawed. The catalogs of a dozen other vendors list equally dull and ineffective watchdogs. 

But both TI and Maxim offer more sophisticated devices. Consider TI's TPS3813 and Maxim's 
MAX6323. Both are "Window Watchdogs." Unlike the internal versions described above, which 
avoid timeouts using two different data writes (like a 0x55 and then OxAA), these require tickling 
within certain time bands. Toggle the WDT input too slowly, too fast, or not at all, and a timeout 
will occur. That greatly reduces the chances that a program run amok will create the precise timing 
needed to satisfy the watchdog. Since a crashed program will likely speed up or bog down if it does 
anything at all, errant strobing of the tickle bit will almost certainly be outside the time band 
required. 

The subject of watchdog timers is far too big to cover in a single column, so I'll carry on next 
month. But for now, remember that since software crashes are common, we're foolish not to equip 
our devices with some sort of recovery mechanism. And it's unwise to rely on a recovery strategy 
that's easily defeated by runaway code that could be doing virtually anything. When systems fail, 
they often do it in the worst possible way. 

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts 
seminars on embedded systems and helps companies with their embedded challenges. Contact him 
at jack@ganssle.com . 
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